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Abstract 



In constrained adaptive testing, the numbers of constraints needed to control the content of the 
tests can easily run into the hundreds. Proper initialization of the algorithm becomes a 
requirement because the presence of large numbers of constraints slows down the 
convergence of the ability estimator. In this paper, an empirical initialization of the algorithm 
is proposed based on the statistical relation between the ability variable and background 
variables known prior to the test. The relation is modeled using a two-parameter logistic 
version of an IRT model with manifest predictors discussed in Zwinderman (1991). An 
empirical example shows how an (incomplete) sample of response data and data on 
background variables can be used to derive an initial ability estimate or an empirical prior 
distribution for the ability parameter. 
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A Procedure for Empirical Initialization of 
Adaptive Testing Algorithms 



Item response theory (IRT) models the probability of a response to a test item as the 
result of an interaction between the properties of the item and the ability of the examinee. 
Typically, this interaction is mapped on a parameter structure for the probability function for 
the response with separate parameters for the examinee and the item. One of the main 
advantages of separate parameterization of examinees and items is that it is possible to select 
items to match the abilities of examinees. If a conventional linear test has to be assembled, a 
standard approach is to set a target for the information function of the test with optimal values 
for the part of the ability scale where the examinees are expected to be and select a 
combination of items that meets the target best in some sense (Bimbaum, 1968). A more 
powerful application of the principle is found in computerized adaptive testing (CAT) where 
each individual item in the test is selected to match the current estimate of the ability of the 
examinee. A popular implementation of the principle of adaptive testing calculates the 
maximum-likelihood (ML) estimate of ability from the updated response vector of the 
examinee and selects the next item to have maximum statistical information at the estimate. 
An alternative Bayesian procedure is to select the items to optimize the posterior distribution 
of the ability parameter. 

Under general conditions, both the MLE and the Bayesian estimator defined on the 
posterior ability distribution are known to converge to the true ability (Chang & Ying, 1996; 
Gelman, Carlin, Stem, & Rubin, 1995, Appendix B). The speed of convergence of the 
algorithm depends on the initialization of the algorithm. Generally, the farther the initial 
ability estimate or prior distribution away from the true ability of the examinee, the slower the 
algorithm converges to an estimator with prescribed precision. On the other hand, a perfect 
initialization does not imply an immediate stop of the algorithm. IRT models define a 
stochastic relation between the ability and the responses, and even with a perfect start a CAT 
algorithm needs some time to accept the true ability value with enough certainty. 

In constrained adaptive testing the objective is to select items from the pool to 
maximize the statistical precision of the ability estimator subject to constraints on item or test 
attributes, or constraints needed to deal with a possible item-set structure or to control item 
exposure. For a realistic testing program, the number of constraints can easily run into the 
hundreds (van der Linden & Reese, 1997). Generally, the presence of such constraints slows 
O own the convergence of the ability estimator reinforcing the need for optimal initialization of 




5 



Empirical Initialization of Adaptive Testing - 3 



O 



the algorithm. 

An obvious way to improve initialization of a CAT procedure is to use empirical 
predictors of the ability of the examinee. These predictors may be available in the form of 
information on background variables at hand when administering the test. For example, to 
register for a CAT sessions examinees usually have to fill out a form with biographical 
information, and a useful statistical relation may be present between the data in the form and 
the abilities of the examinee. Also, most CAT sessions begin with the examinees reading the 
instruction to the test and responding to a few exercises before the actual test starts. The time 
needed to work through the instruction and/or the responses given to the exercises may 
already contain statistical information on the ability of the examinee. As a final example, 
knowledge of scores on previous attempts to pass the same test, possibly in combination with 
the amount of time elapsed since the last attempt and/or information on intermediate coaching 
could be used to predict the ability of the examinee. 

The point of view that additional sources of information on ability available at the time 
of testing should be exploited is certainly not new. The principle forms a standard belief in 
Bayesian statistics. However, application of the principle has been inhibited by the perception 
that in testing the abilities of the examinees should "speak for themselves" and that it may be 
unfair to let (possibly unfavorable) background information creep into score test scores. This 
concern is unfunded because it does not make the important distinction between the 
experiment of selecting the items and the one of generating responses to the items once they 
are selected. The key question is whether the former can be ignored when estimating the 
ability of the examinee from the responses obtained through the latter (for a formal definition 
of ignorability, see Little and Rubin, 1987). As shown in Mislevy and Wu (1988), in adaptive 
testing the item selection mechanism can be ignored under maximum-likelihood estimation if 
the interest is in the value of the ability estimate and not in inferences with respect to the 
sampling distribution of the estimator. Bayesian inference is legitimate provided the 
knowledge of the background variables has been incorporated into the prior. More formally, 
the condition means that if data on P background variables Xp, p=0,...,P with a statistical 
relation to the ability variable 9 have been inspected by the statistician, the conditional 
distribution of 9 given Xi=xi,...,Xp=xp is the correct prior in Bayesian inference. This paper 
addresses the question of how to obtain an empirical estimate of the prior from data on the 
background variables Xp. The estimate can be used to initialize an empirical Bayes algorithm 
in adaptive testing. Also, the mean of the prior provides an initial point estimate of the ability 
F the examinee and can be used to select the first item if the interest is in point estimation of 
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ability, for example, in combination with maximum-information item selection. 

Models and Procedures 

It is assumed that the probability of a successful response to item i=l,...,I in the pool 
can be described by the 2-parameter logistic (2-PL) model: 



Pi(0) = Prob{Ui = ll0} 



exp[ai(g-bj)] 
l+exp[ai(0-bi)] ’ 



( 1 ) 



where 6 e (-«>,«>) is the parameter representing the ability of the examinee, and bi e (-«>,«») 
and ai e [0,«>) are the parameters for the difficulty and discrimination of the item, respectively. 

Further, it is assumed that the P predictor or background variables, Xp, p=0,...,P, have 
the following statistical relation to the ability parameter: 

0 = )8„ + )8,x,+...+Xp)8p+£, (2) 



with error term e distributed as 

e_N(0,a'). (3) 

Recall that the model in (2) only has to be linear in the parameters . The model therefore 

covers the wide class of relations that can be brought into linear form by a monotonic 
transformation of the original predictor variables (for some examples, see Neter, Wasserman & 
Kutner, 1990, chap. 4). Also, the variables in (2) can be chosen to represent higher-order 
predictors in an original polynomial model. Generally, the assumption of normality in (3) is a 
better approximation to reality, the larger the selection of predictor variables in (2). 

From (l)-(2) it follows that 

0IX, = x„...,Xp = xp _N(^o + ^,x,+...+^pXP,a'), (4) 

is the empirical distribution of the ability of an examinee randomly sampled from a 
® jbpopulation with Xi=xi, ..., Xp=Xp. Before introducing the distribution in (4) as an empirical 
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prior in adaptive testing, the adaptive procedures considered in this paper will be introduced. 

The items in the pool are denoted by index i=l,...,I, and index k=l,...,n is used to 
represent the position of the items in the test. Thus, ik is the index of the item in the pool 
administered as the kth item in the test. The set Sk-i = {ii,„.,ik-i } is used to denote the first k-1 
items in the test. The set of remaining items in the pool is denoted as Rk= {l,...,I}\Sk-i. 
Finally, represents the estimator of 0 after k-1 items have been administered. 

In adaptive testing with the maximum-information criterion, is chosen to be the 
ML estimate of 6 and the next item is selected according to 

ik = maXhllhC^^*"’^^); heRk} , (5) 



where Ih( ^ ) is defined as Fisher's information in Uh on ^ . In a Bayesian approach, the response 
= Ui^., jg jQ update the posterior distribution of ^ by an application of Bayes theorem: 



f(01Ui,,...,UiJ 



f(Ui,,l^)f(^IUj,,...,UiJ 

Jf(Ui,J0)f(0lu,,...,UiJd0 



( 6 ) 



where f(Ui^.,l0) is the probability function of Ui^.j = Ui,., defined by (1). Several Bayesian item 
selection criteria are possible. For example, a well-known practice is to choose to be the 
expected a posteriori (EAP) estimator, which is the expected value of 6 over the distribution in 
(6), and use this estimator to select the item with maximum information. Another example is the 
minimum expected posterior variance criterion which for each remaining item in the pool, 
heRk-i, predicts the posterior variance of the ability estimator both after a correct and an 
incorrect response and selects the item with minimum expected variance over the responses; 
that is, 



ik s maXh{l-Ph(^'^‘‘')Var(eiui,,...,Ui,.,,Uh = 0) 

+ Pj(^'"-'>)Var(eiUi„...,Ui,.„Uk = l);h€Rk.,}. 



(7) 



Other examples of Bayesian item selection criteria are given in van der Linden (1996). 
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Initialization of the Algorithm 

Both the maximum-information and Bayesian item selection criteria need proper 
initialization to accelerate the convergence of the algorithm. If ML estimation is used, the 
initial item often is chosen to be informative near the middle of the ability distribution 
expected for the population of examinees, but this choice may be suboptimal for a substantial 
portion of the examinees. In a Bayesian procedure, a typical choice of the prior distribution is 
a flat prior or a normal prior with large variance located at the middle of the expected ability 
distribution. The former ignores possible information available about the examinee and the 
latter may also be suboptimal for a substantial portion of the examinees. 

In this paper, an empirical estimate of (4) as the prior in the adaptive procedure is 
proposed, that is , the density f ( 01 Xi , . . . , Xp ) belonging to the normal distribution in (4) is used to 

initialize the procedure. This prior defines the following EAP estimate 



which can be used as an initial point estimate for the ability parameter in the maximum- 
information criterion. In a full Bayesian procedure, f(0ixi,...,Xp) can be used as an empirical 
prior, yielding as the first posterior: 



f(0IUh,Xi,...,Xp) = 



fKI0)f(0ix, xp) ^ 

Jf(uhl0)f(0lx,,...,xp)d0’ 



(9) 



where f(uhl0,Xi,...,Xp) = f(uhl0) due to conditional independence of Uh and Xi,...,Xp given 0. 
To implement (8), the values of the parameters should be known. To 

implement (9), in addition the value of the parameter is needed. This value can be 
interpreted as an empirical measure of the prior uncertainty about 0 . A method for estimating 
the values of these parameters is discussed in the following section. 

Estimation of Regressions Parameters 

A seemingly straightforward approach to estimating the regression parameters in (2) 
Q ould be to regress ^ on the predictors Xi,...,Xp using the minimum least-squares criterion. 

ERJC 
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However, though the approach might yield satisfactory estimators of the parameters 

it does not give sound results for the estimation of the variance of the prior, cr^ , due to 
confounding of the prediction error e with the estimation error in ^ . Thus, though this 
approach would allow empirical prediction of the initial ability estimate in a CAT procedure, it 
would not give us a proper empirical estimate of the prior uncertainty needed for the update in 
(9). Therefore, a better alternative is direct estimation of and cr from response data. 

Substitution of the regression equation in (2) into the 2-PL model in (1) for an 
examinee with Xi=xi,...,Xp=Xp gives the following logistic regression model: 



Pi(0) = Prob{Ui = lie} 



exp[ai QSq + xi +. . .+ xp + £ - bj)] 

1 + exp[ai + /Jj xi +. . .+ )3p xp + e - bj)] ' 



( 10 ) 



For ai=l, i=l,...,I, the model in (10) was discussed by Zwinderman (1991, 1997) as a 
generalized Rasch model with manifest predictors. Following an approach in Rigdon and 
Tsutakawa (1983), Zinderman presents an EM algorithm for joint estimation of the item 
difficulty and the regression parameters. The algorithm is adapted here to the case of the 2-PL 
with item parameters known from previous item pool calibration. In addition, a discussion is 
included on how to use the algorithm to estimate the regressions parameters from data 
collected in an operational adaptive testing program for which it holds that responses to some 
of the items in the pool are missing. 

Let Uij, i=l,...,I, j=l,...,N, denote the response of examinee j on item i. For each 
examinee, there is an unknown realization £j of the error term £ in (3) which is treated as 
missing data in the EM algorithm. The values of the predictor variables Xij=xij,...,Xpj=xpj are 
treated as known parameters. For examinee j, let p(UjlXj,j^,£j) be the probability of the 
observed response vector Uj = (Uji,...,UjO given predictor values Xj = (xi,...,Xp), parameter 
vector P=^{Po,...,Pp) , and missing datum £j. In addition, p(£jlcr) is the (normal) probability 
density of £j . It follows that 



k^l 



( 11 ) 



and 
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p(£j)(7) = (2;T(7^)’^^^exp(-£jV2(7^). (12) 

The log*likelihood function associated with the complete data is given by 

1 I 

L(^, ct; Uj IXj ,£ j) = - - M27T a^) - £■ / 2 cr^ X uu ln(Pi (0j)) + (1 - uu) ln( 1 - P| (0j)) .(13) 

^ i=l 

The expectation of the complete-data log-likelihood over the posterior predictive distribution 
of is calculated. The density of this distribution is given by 



p(ejlUj,Xj,^,cr) p(Ujl2ij,^j,£j)p(£jlo’) 

I 

exp[£j X ai Uij - £j / 2 < 7 ^] 



nil + exp[ai(/Jo+-+/^pXjp+£j-bi)]} 

isl 



(14) 



The expected complete-data log-likelihood for N examinees is equal to 



N ^ 

E(L(^,ct;uIx) = ln(2;rc7^)-(l/2c7^)Xl£^P(£lUj.Xj,^„c7)d£ 

Z j=i 



+ XXl[uijlnPi(0j) + (l-Uij)ln(l-Pi(0j))]p(£lUj,Xj,^,c7)d£, (15) 

j=i i=i 

where u = (uij) . 

The algorithm consists of repeated application of expectation and maximization steps 
until convergence. At step t the calculations are: 

E-step . Calculation of the expected complete-data log-likelihood in (15) given the 

values of and , p=0,...,P calculated at the step t-1. 

M-step . Calculation of the values of the estimates and j3p\ p=0,...,P, maximizing 
the expected complete-data log-likelihood from the E-step. 
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As shown in the Appendix, the two steps boil down to iterative use of the following 
recursive relations: 

C7^'> = (16) 

j=i 



N I 

XxjpXaiUij = 

j=i i=i 



N 

Xxjp 




j=l i=l 



exp[ai (/^o' + . . . + /^p* X jp +. . .+ /?p* Xjp + £ - bi)] 

1 + exp[ai(/3®+...+/3“xjp+...+/3p’xjp + £-bi)] ' 



p(eluj.XjJ‘ \CT‘''>)d£, p = 0,...,P, 



(17) 



where xjo = 1. Note that in the first equation the prior variance is equated to the average 
posterior predicted variance. Likewise, the left-hand sums in (17) are equated to their posterior 
predicted expected values. Standard numerical procedures can be used to solve the equations. 
For a possible choice of procedure, see the empirical example below. 

Parameter estimation in operational CAT . The notation in the preceding section 
assumes that the examinees respond to all of the items in the pool. However, as already 
discussed, missing responses to items in an adaptive test can be ignored in ML estimation if 
no inferences are made with respect to the sampling distribution of the estimator. Thus, if the 
interest is only in point estimates of the parameters )8i,...,j3p and <j^ and not in estimating 
such quantities as their standard errors, data from an operational CAT program can be used to 
calculate these point estimates. In doing so, for each examinee the items not administered are 
simply omitted in the equations in (16)-(17). 

In addition, it is possible to use new responses to update previous estimates of the 
parameters by extending the sums in (16)-(17) over the new examinees and using the old 
estimates as the starting values in new iterations of the EM algorithm. 

Empirical Example 

Adaptive versions of Dutch translations of four subtests of the General Aptitude Test 
Battery (GATE) were studied in Schoonman (1989). In this study, the subtest Name 
^ '"omparison was first administered to the examinees in the sample (N=306) and total response 
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time was recorded. Responses on the next subtest, Vocabulary, were used to estimate the 
ability of the examinees measured by this subtest. Schoonman reported a linear correlation 
between the log response times on the first test and the estimated abilities on the second test 
equal to -.46. 

The data set was reanalyzed to estimate both the parameters Pq and in the 
regression equation of the true Vocabulary ability on the log response time and the uncertainty 
parameter o directly from the data. The equations in (17) were solved for p=0,l using Newton's 
method which gives the updates 






9MnE(L) 

dPl 

9MnE(L) 



9^nE(L)T 






9^1nE(L) 

9^? 



9E(L) 9E(L) , 

^ dP, ’ dP, ^ 



( 18 ) 



The first derivatives in (18) are given in (A.5), whereas the second derivatives can be shown 
to be equal to 



9'E(L) 

9^0 



N I r eXp[ai(^”+^i'^Xj+g-bi)] 

(1 + cxp[&;iPf + Pfxj + e-hd]f 



p(el.)de, 



( 19 ) 



9Pj j=l i=l [1 + CXp[fli(PQ Pi Xj + £“bi)]l 



( 20 ) 



9'E(L) 

9^oA 



N I r exp[ai(^”+^i'^xj+g-bi)] 

(1 + cxp[aiiP^^ + Pfxi + e-hd]f 



p(el.)de, 



( 21 ) 



and p(d.) follows from (14). The integrals in (16) and (18) were calculated using Gauss- 
Hermite quadrature. Several sets of starting values were tried, all resulting in the following 
estimates: =5.833, =-1.279, and ct^=.986. Thus, the best way to initialize the adaptive 

procedure for the Vocabulary test is to use =5.833-1.279x or to use N(5.833-1.279x,.986) as 
a prior distribution for 9 . 
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The standard deviation of the log response times was calculated as .433. It follows that 
the correlation between the true abilities and the log response times can be estimated as -.59. 
The difference between this result and Schoonman’s estimate of -.46 indicates the loss of 
information incurred when estimating the' regression parameters from the estimated abilities 
rather than the true abilities using the model in this paper. 
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Appendix 



The equation for the estimator of is found by setting the derivative of the expected 
complete-data log-likelihood in (15) to c equal to zero. As only the first two terms are needed, 



3 ^ 

— [-N/21n(2;rcT') - (l/2cr')Xle'p(eliij.Xj,^,cT)de] 

OCX j=i 

N 

j=i 



(A.1) 



Multiplying by yields 



= N'*ZJe^p(elUj,Xj,^,cr)de. 



j=i 



(A.2) 



Likewise, the equations for p=0,...,P, are found by setting the derivative of the last 
term of (15) equal to zero: 



[ZZltuijlnPij + (1 - Uij) ln(l - Pij)] p(elUj,Xj,^,CT)d e 

op j=i i=i 






(A.3) 



Because 



9 Pij 



(A.4) 



it follows that 




N I 



ZZlaiXjp(uij-Pjj)p(eliij,Xj,^,cT)de = 0. 
j=i i=i 




(A.5) 
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Hence, 



N I N I 

j»l i=l j=l i=l 



CXp[^i (^Po"\~P\ Xji Xjp«f g-bj)] 

1 + exp[ai(^o"''^)Xj>'''-"'''^pXjp'''^‘bi)] 



p(elUj,Xj,j2,cr)de . 



(A.6) 
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